Lab 2 Text Analysis

Advanced Analytic Techniques

Sofia Zaidman

3/27/23

Beatles Lyrics Text Analysis

I am using a dataset that contains Beatles song lyrics for songs from 13 albums. I found the dataset on a public GitHub page. I will compare lyrics primarily grouped by album.

Importing data, cleaning and creating bag of words:

Import Beatles lyrics dataset, remove two songs that have no lyrics.

Functions to remove punctuation and stopwords

Clean text and create new column

The dataset has 184 songs

Lyrics could be grouped by "EARLY_LATE" (whether the song album was from the Beatles early or late period), composer, year, or album. I am going to analyze lyric differences between albums.

Function to get count-vectorized or tf-idf matrix in dataframe form (converting to "bag of words")

Aggregating words by album and vectorizing

Dropping some commonly occuring "words" that are really just exclamations or parts of contractions

Transpose dataframe so columns are each album, rows are words and values are word counts

Most Used Words

Top 20 most used words in all albums:

It is certainly no surprise that "love" would be by far the most commonly occuring word throughout all albums. This is just the raw count of occurrences of a word (I didn't do anything to remove multiple occurrences of a word in a song or album), so I'm thinking a lot of this probably comes from "All You Need Is Love," in which the word "love" is repeated many times. We can test this theory if we break the top songs down by album.

I would assume that there would be a wide variation of word occurences and most popular words between albums. The Beatles are known for going from more generic or traditional lyrics and songwriting in their early years to avant-garde and complex lyrics in their later albums, so I'm thinking that later albums will have more unusual popular words and earlier albums will have more common popular words. Let's chart the 10 most popular songs in each album and compare:

Actually, the word "love" shows up the most in A Hard Day's Night, likely from the song "Can't Buy Me Love." For many albums, it looks like the top words are coming mostly from one song that has a lot of repeated words in a chorus. Some albums have much more often repeated words than others. For instance, A Hard Day's Night has "love" repeated over 80x. Please Please me has love repeated almost 60x. On the flip side, Revolver, Sgt. Pepper and Abbey Road's most used words were all used below 30x.

Very interestingly, this Rolling Stone poll of top Beatles albums put Revolver, Sgt. Pepper and Abbey Road as reader's highest ranked Beatles albums - the three albums with the least often repeated top words. Perhaps fans prefer albums with songs that have more diverse lyrics.

On the whole, later albums do appear to have top words repeated fewer times than earlier albums. It's hard to say for sure if the top words in later albums are more unusual than the top words in earlier albums, because there are some pretty weird words coming up in earlier albums ("shuop"). It would be interesting to do a part of speech analysis on this in the future.

Another interesting thing to note is the change in popularity of the word "girl" throughout the albums. "Girl" was highly ranked in Please Please me and Beatles for Sale, then first ranked in the following two albums Help! and Rubber Soul. After Rubber Soul, "girl" didn't make it into the top words for any album again. This coincides with the Beatles' switch from writing albums and touring to experimenting and innovating in the studio. They were no longer interested in playing directly for fans, many of them young girls.

Tf-idf of most common 100 words

Another way to compare word popularity across albums is with tf-idf rather than simple count. Tf-idf penalizes common words accross all albums, so using tf-idf rather than count will highlight words that are most uniquely popular for each album. I'm limiting my analysis to top 100 words so that the results can be more easily interpretable.

See tf-idf of top 20 words overall for all albums

Ordering by word tf-idf for their first album.

Plotting tf-idf of top 100 words for album pairs

By plotting word tf-idf for one album on one axis and another album on another axis, we can get a sense of which popular words are shared between albums and which words are unique to an album.

With the Beatles vs. Yellow Submarine

Not very many words are in the middle of the graph, indicating that these two albums are not very similar in their vocabularies.

An interesting thing to note is that With the Beatles has a high-ish tf-idf for "wanna" and "want" while Yellow Submarine has a high-ish tf-idf for "need". A similar kind of meaning, but different words with different feelings.

Rubber Soul vs. Please Please Me

There are many more words in the center of the graph for these two albums compared to the previous two albums. We can see that these two albums are more simmilar in their vocabularies.

Correlation of words between albums:

We can formally, statistically assess word frequency correlations with a correlation table (graphed onto a heatmap). I'm going to compare correlations in word count for all words in the corpus.

There are a couple albums that we can pick out that seem to be more correlated with other albums, meaning that they have more words in common with other albums and are potentially less unique in their lyrics. Please Please Me stands out as having has higher correlations with several other albums than most other albums have. Rubber Soul also stands out as having a good amount of higher correlations. Perhaps these two albums have more generic lyrics when considered against the entire Beatles catalogue.

Conversely, Yellow Submarine has notably lower correlations with many albums than other albums have with each other. This makes sense to me in light of the previous analysis that showed that two of Yellow Submarine's most popular words are "yellow" and "submarine." Clearly, those words are unlikely to be popular in other albums, so that would decrease the album's word correlation with other albums.

Lennon vs. McCartney Words

Next I want to take a look at relative frequencies of words used in songs credited only to John Lennon vs. songs credited only to Paul McCartney. I'm omitting any Lennon-McCartney songs.

There are roughly the same amount of songs composed by John Lennon vs. Paul McCartney in the dataset.

Creating a bag of words broken down by composer (Lennon or McCartney):

We can sort the dataframe to see top words for each:

Generating word frequencies for each

Getting relative frequencies:

Part of speech tagging:

Getting Vader sentiment score for each word:

We can sort to see the most Lennon-like and McCartney-like words:

Overall, the most Lennon-like words are slightly more negative-leaning than the most McCartney-like words. None of the most McCartney-like words are marked as having any sentiment score using the vader sentiment package.

I see a couple differences in part of speech patterns between the top Lennon-like and McCartney-like words. There are twice as many verbs in the Lennon-like words. There are 3 adverbs in the McCartney-like word list, and no adverbs in the Lennon-like word list.

Subjectively, the Lennon-like words thematically seem to suggest more angst and longing while the McCartney-like words are more peppy or active (maybe?). This fits with the public perception of their personalities.

LDA to find topics across albums

Lastly, I'm going to run an LDA on the albums to see if I can pull out any distinct topics from the lyrics in each album. This analysis will show which topics are predominant for which album. We will also be able to see which words contribute most to each topic.

Because I'm not going to get too involved and use gridsearch, I have to choose how many topics I want the LDA to fit the data to. If I was going to seriously try and optimize this to find the best fit I would use gridsearch to get the best-fitting number of topics.

For now, I am choosing just 2 topics for the LDA. This is based on my theory that there are potentially two major themes in the Beatle's albums lyrics - mostly based on the difference between early vs. late Beatles.

See the dominant topic for each album:

See the most important words for each topic:

There are a similar number of albums falling into the two topics. It isn't a clean split between early and late albums, but the first topic seems to lean earlier than the second topic.

Looking at the most important words for each topic, they appear to be similar in a lot of ways with love being the top word for both, and know, see and come also important for both.

The main difference is that the first topic has girl as the second most important word, while the second topic doesn't have girl show up at all in the top 20 most important words. That is interesting in light of other analyses in this workbook that have uncovered patterns in the popularity and usage of "girl". We saw that "girl" was highly ranked for earlier albums and didn't show up as a top word again after Rubber Soul. We also saw that "girl" was a highly-ranked Lennon-like word.

I'm going to quickly chart the number of Lennon vs. McCartney composed songs in each album:

More songs were composed only by John Lennon in earlier albums. Later albums had more songs compsoed by McCartney for the most part. This pattern might help to explain why an album would be dominantly topic 0 or topic 1. Perhaps albums with more Lennon songs would have more topic 0 while albums with more McCartney songs would have more topic 1. We can check that:

2/3 McCartney dominated albums are categorized as Topic 1, and 4/7 Lennon dominated albums are categorized as Topic 0. Maybe there is some small effect here with McCartney dominated albums leaning more toward topic 1 and lennon dominated albums leaning more toward topic 0.

Categorizing songs by topic

Now I'm curious to look on a song level to see which composers have more songs from each topic:

Vectorizing by song:

Applying LDA topics to songs:

Now considering all composers, we see that George Harrison is the only composer with more songs having topic 1 than topic 0. Paul McCartney and John Lennon have very similar shares of topics, though their combined Lennon/McCartney composed songs have notably more topic 0.

This somewhat refutes my theory that more McCartney songs are more heavily topic 1 while more Lennon songs are more heavily topic 0.